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Abstract. Bayesian methods - either based on Bayes Factors or BIC - are now widely 
used for model selection. One property that might reasonably be demanded of any model 
selection method is that if a model Mi is preferred to a model Mo, when these two models 
are expressed as members of one model class M, this preference is preserved when they are 
embedded in a different class M'. However, we illustrate in this paper that with the usual 
implementation of these common Bayesian procedures this property does not hold true 
even approximately. We therefore contend that to use these methods it is first necessary 
for there to exist a "natural" embedding class. We argue that in any context like the 
one illustrated in our running example of Bayesian model selection of binary phylogenetic 
trees there is no such embedding. 

1. Introduction 

Bayesian method such as Bayes Factor (BF) (for example Denison et al. (2002)), or ones 
based on the Bayesian Information Criterion (BIC) Schwarz (1978) are now widespread in 
statistical analysis. In this paper we show that a disadvantage of these approaches, in the 
way these are routinely used, is that they can give rise to an awkward inferential ambiguity, 
which we later argue for some problems can never be resolved. 

Selecting a model can be seen as a special case of providing a preference order over 
a number of options, which we assume to be finite. A long time ago Nash (1950) and 
Thomas (1984) argue that a particular property - restated below in terms that apply for 
model selection - is essential for preference orderings used to identify an optimal choice. 
This states that if Mi is preferred to Mo (written M\ y Mo) in the model class M then 
Mi >- Mq also in M' := M U M+ whenever Mi >- M+ for all M+ G M+. So in particular 
if M* is a best model in M and we extend our selection to contain other models all of 
which are worse than M*, then M* remains a best model in the larger set of models. This 
natural property is called independence of irrelevant alternatives (HA). 

If IIA does not hold, then, whether or not we include poor fitting models in the selection 
set will influence what model we label as "best". Why should the choice of this model be 
influenced by inclusion or exclusion of other candidate models later discovered to be poor 
representatives of the underlying data generating process? Surely any routine method of 
Bayesian Model selection: for example such as those reviewed by Bernardo &: Smith (1994) 
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and Key et al. (1999) - even ones assuming that all models were wrong - should be expected 
not to violate IIA. 

Happily most applications of a naive Bayes Factor (BF) model selection satisfy the 
property of IIA in the following sense. Let X be the random vector of observations over 
which models are selected, taking values x in the sample space X C M m . For simplicity 
assume that all predictive densities associated with these models are bounded and strictly 
positive over their shared support. Let Mq be a reference model with predictive density 
Pq(x). Let Mff, be another candidate model with predictive density p^(x) on x G X and a 
prior odds relative to Mq of A^, where (p^(x), A^) are functions of <f> alone and in particular 
not the predictive density and prior model probability (p^/(x), A^') for any other candidate 
model M^r. Then setting Ao = 1, we note immediately that y Mfa if and only if for 
the value of x we observe 

logp^ (x) + log A^ 2 > logp^ (x) + log A 01 . 

Clearly this preference is therefore unaffected by the values of (p^/(x), A^/) for which 
may or may not be contained in the selection class. 

Our problems begin when the class of models M over which selection takes place is 
extremely large. Then the necessary task of carefully and individually choosing separate 
prior distributions over the hyperparameters of each candidate model is clearly infeasible. 
We are therefore forced to reference our choice of prior density over the parameters of each 
candidate so that in some sense these are consistent with each other. In this situation 
(p^(x), A^) is highly related with (p^/(x), X^>) for other candidate models and the BF may 
break the IIA property. 

For example this is exactly what happens when selecting over the class of Bayesian 
networks (BNs). In this case Heckerman et al. (1995) introduced an additional demand 
that the prior densities satisfied parameter modularity. Here when parts of the structure 
of two of these multivariate models coincide, then the priors over this shared structure in 
these two models are assumed to be the same. This assumption, and others like it, not only 
makes the assignment of priors across a large model class feasible but also makes it possible 
to use greedy search algorithms to efficiently search the space for the best candidate model. 

However, these methods come at a price. Because prior densities are chosen to match 
those given within the model class, the choice of class itself can affect the inference and 
in particular IIA can be violated. Of course occasionally, in simple applications, there are 
compelling reasons why a particular model class should be used. Then the violation of IIA 
is not a problem. However, the choice of embedding class is often chosen for convenience 
or convention rather than for some phenomenological reason associated with the modelled 
process. It is in these circumstances that the violation of IIA gives rise to poor inference. 

One such setting occurs when the modeler must decide whether or not to include vari- 
ables which usefully explain the process but cannot be observed. Within the Bayesian 
paradigm the fact that these variables are not observed does not matter in any formal 
sense: the score will simply be the log-marginal density of the observed variables where we 
marginalize over the missing ones. Thus this missingness causes no methodological prob- 
lems. Indeed, if these integrations cannot be executed in closed form then their score can 
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be calculated in a straightforward manner using now standard numerical techniques. The 
problem is rather that two statistically equivalent models will be treated differently de- 
pending on whether or not the variables representing these underlying causes are included. 

To be more concrete we consider the class of phylogenetic tree models. The evolution 
of a collection of different extant species is typically represented by a Bayesian network 
on a directed tree, possibly with some additional constraints on the parameter vector (see 
for example Semple & Steel (2003), Yang (2006)). The extant species are represented by 
the leaves of a tree, whose interior vertices label extinct ancestors. The model is usually 
depicted as a tree with edge lengths like for example in Figure 1. The topology of the tree 
represents the underlying graphical model. The lengths are functions of the conditional 
probabilities parametrising the model and in the phylogenetic context they give a measure 
of the phylogenetic distance between two species. 




Figure 1. A simple directed phylogenetic tree. 

In its simplest form each variable on the tree is binary and represents the presence or 
absence of a characteristic in a large class of genetic locations, believed to occur at random 
with a particular probability determined by the vertex. The usual evolutionary hypotheses 
tell us that, if the tree is valid, then collections of variables separated from each other by any 
internal vertex are independent of each other given the value of that vertex. Our problem 
is to select an evolutionary tree that gives the most plausible evolutionary explanation of 
the data we have observed on the extant species. For any tree there is a formal Bayesian 
selection method to do this. We simply assign a prior density to each parameter, using 
methods such as those described by Heckerman (1999) respecting parameter modularity, 
calculate the corresponding log marginal likelihood score marginalising over the hidden 
variables and choose the tree scoring the highest. 

The problem occurs because it can be proved (see for example Settimi &; Smith (2000)) 
that a tree is statistically the same collection of sample densities over observed vertices 
of the graph as a simpler tree if it contains an interior vertex with only two neighbours. 
A common procedure is therefore to restrict the class of tree models to include only trees 
whose interior vertices have 3 or more neighbours. However, if the ensuing inference were 
to depend on the associated hypothesis that a hereditary ancestor existed only if it had 
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at least two non-extinct associated species is surely not satisfactory for two reasons. First, 
it seems quite conceivable that, in fact, there is only one direct descendant still surviving 
from a particular species. Second, even if we accepted the above, from an inferential point 
of view we should note that the property of having no degree two vertices (i.e. with two 
neighbours) is not closed under taking margins over a subset of the set of leaves of a tree. 
By this we mean that a subtree of the original tree will typically have degree two vertices. 

The fact that a marginal tree contains degree two vertices has important practical impli- 
cations whenever we want to include an additional species in our data set. This happens for 
example in the procedure of outgroup rooting when an outgroup is added in phylogenetic 
analysis in order to find the root of an unrooted phylogenetic tree as described by (Yang, 
2006, Section 3.1.1.2). For a simple example imagine that the undirected tree on the left of 
Figure 2 has been chosen for four extant species. To find the root of this tree an outgroup 
5 has been added. Let say the tree on the right side of Figure 2 has been found for the 
augmented data set. We have now introduced an additional hidden vertex c between 3 and 
4. In this illustrated case the marginal model over {1, 2, 3, 4} in the tree on the right-hand 
side coincides with the original model on the left. However, as we show in this paper, if 
we use automatic model selection methods, it is not clear that the best marginal model for 
{1,2,3,4} in the augmented data set will be the original model; and this is the minimal 
requirement for the outgroup rooting method to work in a consistent way. 




Figure 2. Including an outgroup may distort the analysis. 



In examples like the one given by phylogenetic tree models both standard BF and also 
BIC model selection methods do not in general respect the IIA property. In this paper 
we examine this phenomenon in much more detail with reference to the simplest possible 
manifestation of this ambiguity in Section 2. A natural question to ask is then whether these 
model selection methods are at least approximately invariant to this choice of embedding 
class (as appears to be the case of Consonni & La Rocca (2011) when addressing a rather 
different issue). Sadly the answer to this question is no! The embedding class can have a 
critical impact on the model selection even in the simplest cases. In Section 3 we analyse 
our basic example in a full Bayes factor model selection with conjugate priors, which is a 
default method in numerous R packages and Hugin software. In Section 4 we also show 
that the parametrization ambiguity also applies to BIC model selection and provide an 
in-depth geometrical explanation of this phenomenon. We show again that the embedding 
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class can have a critical impact on the model selection. We end the paper with a short 
discussion of the more general implications of this phenomenon. 

2. The running example 

In our discussion we first want to make a clear distinction between different notions of 
statistical models used in this paper. A model for a random variable X is a family of distri- 
butions of X. A parametric model is a parametric family of probability distributions 
of X together with the defining parametrization ip, which is a map from the given finite 
dimensional parameter space to the space of all probability distributions. We say that 
a model M has a parametric formulation if there exists a parametrization defining this 
model. Let M^, be two parametric models with parametrisations (ft, tp and parame- 
ter sets 0, f2 respectively. We say that and are parametric formulations of the 
same model if <^>(0) = as models. Finally, a Bayesian model is a parametric model 

together with an associated family of prior distributions on the parameter space 0. Hence 
the same model can have many parameteric formulations and each, when combined with 
the associated prior distribution over parameters, can lead to a different Bayesian model. 

To demonstrate the problem described in the introduction we will focus on a comparison 
of two simple models for a vector of two binary random variables X and Z under two 
different parametric formulations. The first model is the saturated model and the second 
is the model of independence X ALZ. The motivation is as follows. Suppose that we 
have two hypotheses: Hq that X and Z are unrelated (XALZ) and Hi that there are 
evolutionary related (X — > Z). What we will demonstrate is that - with routine model 
settings - the second hypothesis must be distinguished from the one where we have an 
evolutionary relationship of the form X — > Y — > Z, where the ancestor of X and the 
predecessor of Z has not been observed. This is so even though both lead to exactly the 
same model for (X, Z) . The problem is that when we refer to evolutionary relation we 
certainly mean both X — > Z, X — > Y — > Z and even X — )■ Y\ — > Yjj — ► Z simultaneously. 
The issue we have here is that therefore from a Bayesian point of view, for model selection 
we need to add all those intermediate vertices that might have occurred between the two 
observed vertices. But how do we determine this number and why should it impinge on 
our choice? 

First, consider the "natural" parametric formulation where the saturated model is parametrised 
by the joint distributions and the independence model by the corresponding marginal distri- 
butions. Denote these parametric models by and respectively and the parameter 
spaces by 0^ O) = [0, l] 2 and ©f } = {9 XZ G R 4 : E£,-=oM«,i) = l.M*,i) > °}- Thus 
m[°> is given by p xz (i,j) = 9 xz (i,j) and M (0) by p xz (i,j) = 9 x (i)9 z (j) for i,j = 0, 1. The 
directed acyclic graphs (see Lauritzen (1996)) representing these models are given in Figure 
3. 

Alternatively, consider two other parametric models for (X,Z): model M± of condi- 
tional independence XALZ\Y, implying the saturated model on the (X,Z) margin and 
represented by a graph X — > Y — > Z; and its submodel of marginal independence of 
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Figure 3. The directed acyclic graphs of the saturated model and the 
independence model under the first parametric formulation. 

X and Z. Here we assume that Y is binary and not observed and the model is parametrized 
by the marginal distribution of X and conditional distributions of Y given X and of Z 
given Y . 

More generally by for k > 1 denote the parametric model for (X, Z) given by the 

graph X — > Y\ —>■•••—)■ Yk — )• Z, where all the Y« are assumed to be binary and hidden. 

(k) 

The corresponding marginal independence model is denoted by Mq and it is a submodel 

of My . The parameter spaces are denoted by 0q and 0^ . The parameters of M± 
are given by the marginal distribution of X and conditional probability for each arrow 
of the corresponding graph. There are exactly 1 + 2(k + 1) free parameters denoted by 
9 X (1), 6i\ x (l\i), 6 2 \i(l\i), 6 z \k(l\i) for i = 0, 1 where for example 6 X (1) = P(X = 1), 
6x\ x (l\0) = P(Y X = l\X = 0) and O^^Vfi) = P{Yj = l|Yj-_i = 0). It follows that 

ef) = [0) i]i+2(fc+i). 

By (Gilula, 1979, Corollary 1), M 1 (fe) for every k > 1 is the saturated model and hence 

it is equivalent to m[ . Since we also have that is equivalent to then for every 

(k) 

k > we compare the same models. Although the models Mq are the same, as parametric 

or Bayesian models they are very different. If k = 1 then M (1) is a union of two parametric 

submodels Y ALZ, X ALY of M± as depicted in Figure 4 and hence Gq 1 ^ is isomorphic to 

a subspace of 0$ given as a union of two intersecting components given by equations: 
0i|j;(l|O) — #i|a;(l|l) = and #^|i(l|0) — = 0. The common intersection locus is the 

singularity of 0Q 1 ' . More generally, the larger is k the more complicated and more singular 

(k) (k) 

is the embedding of the parametric model Mq in . This gives the geometric intuition 
why the model selection for large k may differ from k = 0. 

Mf" : 
M<" : 
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Figure 4. The saturated model and the independence model under the 
second parametric formulation with k = 1. 
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A statistical reason why IIA may fail follows from the discussion in the introduction. 

(k) (k) 

Thus, for fixed k > consider the class of Bayesian models M| and Mq together 
with the associated families of all prior distributions over the corresponding parameter 
spaces. First note that Mo can be naturally embedded into Mi because M (0) = M (1) and 
m[ = m[ as models and the parameter space can be associated with a subspace 
of @i for which X and Y are functionally related (e.g. a u(l|l) = ^^(OlO) = 1). Hence 
every prior on can be treated as a prior concentrated on this subspace. There is 

no statistical way of distinguishing between M± and its copy embedded in m[ . More 
generally Mfc_i C for all k > 1. This follows from the fact that every prior distribution 
for parameters of the Bayesian models in M&_i is a degenerate prior distribution of 
Hence, at least in principle, the statistical analysis should not dependent on a particular 
embedding. 

The problem is that the containment M^_i C M& breaks down whenever we assume 
some sort of regularity of the prior distributions ruling out some possible Bayesian models. 
In this case the analysis may highly depend on k. This is particularly evident in the case 
of the BIC criterion which is derived under the assumption that prior distributions are 
diffuse (bounded and bounded away from zero) and hence then cannot be degenerate. The 
problem with diffuse priors and the BIC criterion has been reported in other contexts. 
In particular it has been shown that BIC tends to support simpler models which follows 
from the Jeffrey-Lindley's paradox (see for example Denison et al. (2002)). Recently it has 
been shown by Johnson & Rossell (2010) and Consonni & La Rocca (2011), that using 
local priors causes in particular a very slow convergence to the true smaller model. In 
Section 4 we show that this problem may be particularly important when the choice of the 
parametric formulation is not clear. 



3. Finite sample conjugate selection 

Perhaps the most common way to set up the prior densities across a large class of 
graphical models is to ensure each decomposable model within the class has consistent 
priors and the same strength in the following sense. Let ir = (7r x ) denote the vector of 
prior probabilities that a unit from the model population takes level x £ X where X is the 
sample space of the problem. This joint prior distribution induces marginal and conditional 
distributions over subvectors of x. 

• Define /3 X = {3n x for x £ X, where the positive real scalar /3, called the effective 
sample size, reflects the number of observations the modeler believes her prior 
beliefs are worth. 

• Differentiate each model in the class with the appropriate hyper-Dirichlet prior (see 
Dawid & Lauritzen (1993)) faithful to its particular sets of conditional independence 
assumptions defined by its graph. The hyper-Dirichlet priors form the conjugate 
class for decomposable graphical models. They are widely used in numerous R 
packages and in Hugin (see Madsen et al. (2003)). 
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• Ensure the relevant prior expected clique cell probabilities and effective sample size 
parameters of these different product Dirichlets are consistent across the different 
models in the class. This is obtained by choosing the associated Dirichlet parame- 
ters /3j of the different (marginal or conditional) components Cj of BN model with 
prior probability 7Tj so that /3 i = fiiti. In this way these Dirichlet distributions 
over various components are consistent with those of a Dirichlet over a saturated 
model with parameters /3 X for x 6 X . This in particular assures that the parameter 
modularity holds. 

In what follows we therefore faithfully follow this procedure applying it to the different 
embeddings below, in addition ensuring that different embedding match in an analogous 
way. In this section we show that the selection procedure between these Bayesian models 
will not satisfy the IIA. 

First we compare the models for k = 0, 1. Let ir xyz be the joint prior distribution 
on (X,Y,Z). By ir xz , ir x , ir z , Tr y \ x and ix z \ y we denote the corresponding prior marginal 

and conditional probabilities. Let first k = 0. For we set the prior distribution to 

be the Dirichlet distribution, 8 Dirf^J, where (3 XZ = f3n xz and f3 > 0. For M^ 0) 
we have 9 X ~ Heta((3 x ) and 9 Z ~ Beta(/3 2 ), where (3 X = /3tt x and (3 Z = (3tt z . Let now 

k = 1. The standard conjugate analysis for requires setting hyper-Dirichlet priors 

for the joint distribution of (X, Y, Z). We set 9 X ~ Beta(/3 a; ), 9 y \ x (l\i) ~ Beta(/3 y | :r (i)) and 
0z\y{-\i) ~ Beta^^^)), where /3 y \ x (i) = Pir y \ x (-\i) and (3 z \ y (i) = Pir z \ y (-\i) for i = 0,1. 
If we assume that all five random variables 9 X , 9 y \ x (-\0), 9 y \ x (-\l), ^^(-lO) and 9 z \ y (-\l) 
are independent, then the variable p xy zt where p xy z (i,j,k) = 9 x (i)8 y \ x (j\i)9 z \ y (k\j), has a 
hyper-Dirichlet distribution. This induces a distribution of p xz by 



which is not in general Dirichlet as for M\ and so gives a different value of the marginal 
likelihood in this case. 

Since the induced distribution of p x and p z is the same as for k = 0, the difference in 
the analyses performed for the Bayes factor follows from the difference in the marginal 
likelihoods of the saturated models. For and we can easily obtain formulae for 
the posterior distribution and for the marginal likelihood functions of the sample counts 
u = [iiij]. So the Bayes factor can be calculated directly. For any table a = [ai\ define the 

Beta function B(a) = 5y^°'x where T(-) is the Gamma function. The marginal likelihood 



i 



(1) 



Pxz(i,k) =p xyz {i,0,k) +p xyz (i,l,k) = 9 x (i)^29 y \ x {j\i)9 z \ y (k\j) 

3=0 



for M 1 (0) is: 



(2) 




B(P XZ + u) 
B(J3 XZ ) 
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The marginal likelihood for Mq is: 



(3) 



L, 



o 



(0) 



B{f5 x + us) B{(3 Z + u z ) 
B(f3 x ) B(f3 z ) 



where and u z denote the marginal counts for X and Z respectively. 

To obtain the marg inal likelihood of M x (1) the prior density for p xz can be written 
explicitly using (1). Exact computations are technically inelegant but some recent devel- 
opments of Lin et al. (2009) make it possible to compute the marginal integrals exactly at 
least for simple mixture models. We note that simple Monte Carlo approximations of the 
integrals below give very good results as well. 

As an example consider the following table of counts: 



The exact Fisher's test in this case gives the value of the odds ratio 6.2 and the corre- 
sponding p-value is 0.012. Hence the data strongly support the alternative hypothesis. 
The Bayes factor for the first parametric formulation, denoted by BFo, is easily computed 
by dividing l[ 0) in (2) by l[, 0) in (3). For k = 1 we calculated scores using Lin et al. 
(2009) confirming these against good approximate results provided by simple Monte Carlo 
simulation. Initially assume that (3 = 4 and Tr X yz(hj,k) = 1/8 for all i,j, k = 0,1. This 
gives p xz a Dirichlet Dir(l, 1, 1, 1) (and hence uniform) distribution if k = (but not if 
k = 1) and: 



So in the second parametric formulation, when k = 1, the Bayes factor slightly underes- 
timates the evidence for the saturated model. The reason here is that the induced prior 
distribution on p y , where p y (j) = ^2ikPxyz(hj,k), is not uniform and equals Beta(2,2). 
This is an important point because it shows that the uniform prior n may lead to highly 
informative scenarios, which may then affect our analysis. 

The result of the analysis changes if we set (3 = 4 and itxyzih 0, k) = i/4 and 7r xyz (i, 1, k) = 
(1 — i)/4 for t < 0.5. Note that the induced distribution on p xz in the first parametric for- 
mulation is still uniform because TT xz (i, k) = 1/4 for all i,k = Q,l but the prior information 
on the distribution of the hidden variable is much stronger. This affects the distribution 
of p xz under the second parametric formulation. The induced distribution of p y is Beta 
with parameters (4(1 — t),4t). In particular, for t < 0.25 the corresponding density is not 
bounded giving increasingly more probability to the event p y (l) > 1 — e as t — > 0. Some 
Beta density functions for different values of t are given in Figure 5. 

If t is small then there is a strong a priori information that the inner vertex is close to 

being degenerate. Therefore, the Bayesian model is close to the model of indepen- 

dence. This should cause overestimation of the independence model. Indeed, if t = 0.2 
then for u given above we have: 



u = 



13 9 
4 18 



(4) 



BF = 11-47, BFi = 6.27. 



BF = 11-47, BFi = 2.68. 
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Comparison of Beta distributions 




0.0 0.2 0.4 0.6 0.8 1.0 



Figure 5. The densities of Beta distributions with parameters (4(1— t), 4t), 
where t = 0.5, 0.4, 0.3, 0.2, 0.1. 

This example illustrates a more serious problem. For each t, as above, we compute 
BFo(i) and BFi(i) obtaining the following result. 

Proposition 3.1. For every table u the Bayes Factor BFo(i) is constant and does not 
depend on t. Moreover, for every u, BFi(t) — > almost surely as t — > 0. 

This result may seem obvious. However, what it really shows is that, even though 
the induced priors on the joint distribution of (X, Z) follow from the same joint prior 
distribution ir on (X, Y, Z) , in the first case the Bayes factor may give evidence in favour 
of the saturated model and the second in favour of the model of independence. 

The situation becomes even more dramatic if k increases. Thus if k = 2 we set (3 = 4 and 
ftxi2z(i,j,k,l) = 1/16 for all i,j,k,l = 0,1 and again the induced prior on p xz for the case 
k = is uniform. Under this setting we find that BF2 = 2.13 which is to be compared with 
(4). This uniform case is easy to generalize and illustrates another serious issue related 
with the choice of the parametric formulation for the model under consideration. 
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For general k > 1, if /3 = 4 and 7r x i...fc 2 (i) = 2 ( fe + 2 ) for every i = (jo, . . . , i^+i) G 
{0, l} k+2 , then the induced prior distributions are 6 X ~ Beta(2, 2), 9i\ x (-\i), ■ ■ ■ , z \k(-\i) ~ 
Beta(l, 1) for i = 0, 1. 

Proposition 3.2. Let /3 = 4 and n xl ... kz (i) = 2~ { - k+ ^ for every i G {0, l} fc+2 . T/ien 
BF fc — » as k — >• oo. 

The proof is given in Section A.l. 

In the case of Proposition 3.1 we analysed the discrepancy of the model selection if the 
prior of the hidden variable becomes degenerate without affecting the prior over the ob- 
served variables. In Proposition 3.2 however the prior distribution of every hidden variable 
is symmetric and hence it shows a different aspect of the discussed problem. 

4. Asymptotic analysis 

Perhaps not surprisingly a similar ambiguity also arises if we use the popular BIC rather 
than Bayes Factor model selection method. Let q denote the true density of (X, Z), as- 
sumed strictly positive everywhere, and let (Xi, Z\), . . . , (X n , Z n ) be a random sample from 
this distribution. In this section we perform an asymptotic analysis. Let Z n denote the 
marginal likelihood function and S q the entropy function of q. We define F n = — logiJ n . 

In the natural parametrisation the asymptotic likelihood as n — > oo is always maximised 
over the unique point q. By the result of Schwarz (1978) the asymptotic formula for the 
marginal likelihood can be obtained using the Laplace approximation. Whenever the prior 
distributions are bounded and bounded away from zero then, as n — > oo, 

(5) EF n = nS q -^logn + 0(l), 

where d = 2 for the marginal likelihood given the model and d = 3 for model 
This asymptotic approximation of the marginal likelihood justifies the use of the well known 
BIC penalty used in routine model selection. If the true distribution lies in the null model 
then the entropies S q for both models will be asymptotically equal and the difference in 
scores BICo — BICi = ^\ogn will be always positive, which gives a positive evidence in 
favour of the null model. 

The interpretation of BIC in the presence of hidden variables is more subtle. Under 
the second proposed parametrisation k = 1 both models are unidentifiable hence the as- 
ymptotic analysis is much harder. In particular the Laplace approximation is no longer 
formally valid and the appropriate asymptotic analysis must use the results of singular 
learning theory developed by Watanabe (2009). 

(k) 

We now compute the marginal likelihood of the data under the parametric model M] 
for k > 1. The asymptotic formula depends on the true distribution q generating the data. 

(k) (k) 

If q G M{ ' \ ' for k > 1 then, despite the identifiability issue, the correct asymptotic 
approximations can be shown to be equal to the classical BIC formula in the case when 

(k) 

k = 0. The problem occurs when q G Mq . In this case the set of parameters mapping to 
q is highly singular. Whenever the prior distribution is bounded and bounded away from 
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zero on the whole parameter space, the asymptotic approximation of ¥,F n for the model 
m[ , as n — > oo, is 

Ep = \nS q -l logn + O(l) if q G Af{*> \ M (fc) , 

" \ nS q - flogn + Hoglogn + 0(l) ifgeM^. 

The proof of (6) is given in Section A. 2. 

(k) (0) 

The marginal likelihood of Mq is equal to the marginal likelihood of Mq ' and thus in 
this case 

EF n = nS q -logn + 0{l). 
Now we see that a problem might occur if the true data generating distribution lies in the 

(k) (k) 

independence model. The bigger k, the harder it gets to distinguish Mq from Mj . Since 
the entropy value will be asymptotically the same in both cases, the difference in scores 
between those two models is 

3 1 

(— log n) — (— — log n + k log log n) = — log n — k log log n. 

Since the true model is the model of independence we expect this difference to be highly 
positive like in the case when k = 0. However, the component of the penalty k log log n 
distorts this as is shown in Figure 6. The score difference remains negative whenever k > 2 
even for very large n. Hence, for all usual values of the sample size n the use of the standard 
BIC criterion underestimates the evidence of the null model whenever k > 1. 

5. Discussion and conclusions 

In this paper we have shown that we need a foundation for justifying a particular embed- 
ding before BF or BIC model selection is unambiguous for binary tree models. This fact 
can be shown also to apply to all model selection over discrete graphical models, Gaussian 
graphical models with potential hidden variables and, in particular, Bayesian hierarchi- 
cal models, where systematically hidden variables are routinely added to the system to 
articulate various types of dependence structures. Problems are particularly acute when 
the dimension of the embedding class is itself contextualy ambiguous as in our running 
example. For all these models whenever the appropriate embedding is not transparent, we 
should be aware this choice could be critical to the result of our selection. In particular the 
current praxis of paying little attention to the different inferential implications of a chosen 
embedding, focusing instead on the numerical efficacy of a particular representation is one 
that should be of great concern to Bayesian modelers. We note that the problems we 
identify here do not concern just BF or BIC: other Bayesian model selection methods also 
suffer the same difficulty 

What can we do to address this issue? First, if there exists an associated meaning 
to a given embedding then we could elicit a prior distribution for each model and then 
average over this. In our example we could for example elicit the number of potential 
differently evolved ancestors for which only one direct descendant survived. However, as in 
our example, it may well be difficult to unambiguously make this association and is certainly 
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n 

Figure 6. The difference in scores | log n—k log log n depicted as a function 
of n for different values of k = 1, 2, 3, 4, 5. The n axis is in the log scale. 

not in the spirit of current model selection, which tries fot the sake of "objectivity" to avoid 
the incorporation of as little domain knowledge as possible: including much more direct 
domain knowledge than this. A second possible direction is to systematically check the 
plausibility of a particular embedding on the associated marginal likelihood of different 
models to check how the system will learn in various contingencies and calibrate to that. 

Finally we could question, as some others do e.g. Dennis Lindley, Draper (1999) whether 
model selection is actually compatible with Bayesian methodology at all. It would be sad 
however to discard these techniques which have undoubtedly provided such useful output to 
scientists endeavouring to understand the processes underpinning their observations. But 
at least when using Bayesian model selection techniques in conjunction with apparently 
innocuous homeneiety assumptions like, in the case of BN model selection, parameter 
modularity we should be aware that the associated inferences could seriously mislead the 
investigator. 
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(k) 

A.l. Proof of Proposition 3.2. Since the marginal likelihood Lq for the null model 

(k) 

does not change with k, it suffices to show that the marginal likelihood L\ of the model 
Ad[ k) converges to zero as k -)• oo. Let = (9 X , 9 l][X , 2 |i, • • , z \k) G [0, l] 1+2 ( fc+1 ) and 

= II \0x(i) E 9 i\x(h\i)O 2 \i(i2\k)---0k\k-i(ik\ik~i)O z \k(j\ik) 

i,j=0 \ h,—,ik 

Since 9 X ~ Beta(2, 2), 9i\ x (-\i), . . . , 9 z \k(-\i) ~ Beta(l, 1) for i = 0,1, we have 
(7) L[ k) = ^ [ H6)\{9 x {i)&e. 

r ( 2 ) Jf0.11 1 + 2 ( fc + 1 ) rn 



i=0 



Let ^ it .. ife = 6>i|a;(ii|i)02|i(*2|*i) • • • 0fc|ft-i(*fcKfc--i) for i, ix, . . . , i k = 0, 1. Note that tp(x) = 
x a is a convex function on [0, oo) whenever a > 1 or a = 0. Since YV. ■ t) „• = 1 for 
i = 0, 1 and either Uy > 1 or = 0, by Jensen's inequality 

Ui) E 4-^^ifc0"|i*)) < E 4-i ft (^«^| fc (ii^))^ 

and hence 

l 

(8) /jfe(0) < J] J] ^('il^lifcl'il-^N^^il^W^O'liif ■ 

h,...,ik *,i=o 

We have 

/ II ^N^, = / 0i|«(ii|O) 2 » ljas (ii|l) a d0 1 | a; (l|l)d6> 1 [ a ,(l|O) = f ^) 

■Wijib Vl 2 \ r ( 4 )/ 

and for I = 2, . . . , k 

J [Q1] II ^|i-i(»tl*i-i)de i |i_i(l|*i_i) = ^^^u-iCiilit-i^d^^iCllij-i) = ■ 
This together with (8) gives 

f l )■■■}'«; 

where C(u) is a constant which depends only on the table of counts u. This in particular 

(k) 

implies that Li ; — ^ as k — > oo. 
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A. 2. Proof of Equation (6). The set of parameters of the model is the marginal dis- 
tribution p x of X, the conditional distribution of Y\ given X = i denoted by 9\{-\i), the 
conditional distribution of Z given Yk = i, denoted by 9 z (-\i) and conditional distributions 
of Yj given Yj-± = i for every j = 2, . . . , k denoted by 9j(-\i). Hence the parameter vector 
9 lies in [0, l] 1 + 2 ( fc + 1 ). We have 

Pxz(i,k) =p x {i) ^2 1 (j 1 \i)9 2 (j2\ji)---0k(jk\jk-i)0z(k\j k ). 

h,—,jh 

Since the true distribution is assumed to lie in the independence model we cannot use 
the standard Laplace approximation for the marginal likelihood because the asymptotic 
likelihood is maximised over a singular subset of the parameter space. Assume that the 
prior distribution on O is bounded and bounded away from zero. Then by The Corollary 
6.1 of Watanabe (2009) the marginal likelihood is asymptotically, as n — > oo, approximated 
by 

nS — Alogn + (m — 1) log log n + 0(1), 
where A and m are the smallest pole and its multiplicity of an analytic function given by 

(9) ew = / (f(o)r w de, 

Je 

where f(9) is the Kullback-Leibler divergence from the true model q. By Theorem 1.2 of 
Lin (2011) we can also replace f(9) with the sum of squares J2i j=o(Pij(@) ~ Qij) 2 - 
Let fi x (9) =pio(0)+p u {e), n z {0) = Pqi{0) + p n {0) and 

H*z{0) = Pu(0) - (j?io(0)+Pn(d))(Poi(O)+Pu(9)). 

It is immediate to see that /j, x , p z , fi xz are in one-to-one correspondence with \pij\- Moreover, 
fJ-xz is just the covariance between X and Z and hence it is zero if and only if X ALZ. Since 
the pole of (9) and its multiplicity do not change under isomorphisms, we can further 
replace the function Yli,j((Pij(@) ~ %') 2 w ith 

(10) (^(0) - /4) 2 + ( M *(0) - p* z f + (iu xz (9) - fi* xz ) 2 . 

The asterisks refer to the moments of the true distribution q. Note that since q lies in the 
independence model then \i xz = 0. 

Now make an isomorphic change of coordinates of 9 to parameters 

w = {^xi /^j/i ) • • • i Hy k i [i-z'i Vxy 1 > %ij/2 ! • ■ • > Vytz) 

where fj, Vi is the mean of Y^ and r) xy = ¥(Y = 1\X = 1) — ¥(Y = 1\X = 0) is the linear 
regression coefficient of Y with respect to X. Since 

Vxz(u) = |(1 - nl)Vx yi ■ ■ ■ riy k z 
then the function in (10) expressed in new parameters becomes 

(fi x - nlf + (fi z - tj* z f + (-(1 - nl)rj xyi ■ ■ ■ 7] ykZ ) 2 . 
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By Remark 7.2 of Watanabe (2009), if oj* is an interior point of the parameter space, 
A = 1 + A' and m = ml where (A ,m') are the smallest pole and its multiplicity of the 
analytic function of w given by 

/{Vxyi ' ' ' Vykz) ^Vxyi • • • drjy kZ , 
L -e,e]*+! 

for a sufficiently small e > 0. Finally, this integral is equal to C(e)( 1 _ 1 2M) ) fc+1 , where C(e) 
is a constant which depends only on e. It follows that the pole of this function is A' = 1 /2 
and the multiplicity of this pole is m' = k + 1 and hence A = 3/2 and m = k + 1. 

Note that to use Remark 7.2 of Watanabe (2009) we assume that the parameter space 
has locally a product structure where fj, x and fi z are independent of other parameters. This 
holds only in the interior of the parameter space so the boundary points u* need to be 
checked separately. We omit the details. 
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